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Most of the medical datasets suffer from missing data, due to the expense of 
some tests or human faults while recording these tests. This issue affects the 
performance of the machine learning models because the values of some 
features will be missing. Therefore, there is a need for a specific type of 
methods for imputing these missing data. In this research, the salp swarm 
algorithm (SSA) is used for generating and imputing the missing values in 
the pain in my ass (also known Pima) Indian diabetes disease (PIDD) 
dataset, the proposed algorithm is called (ISSA). The obtained results 
showed that the classification performance of three different classifiers 
which are support vector machine (SVM), K-nearest neighbour (KNN), and 
Naive Bayesian classifier (NBC) have been enhanced as compared to the 
dataset before applying the proposed method. Moreover, the results 
indicated that issa was performed better than the statistical imputation 
techniques such as deleting the samples with missing values, replacing the 


missing values with zeros, mean, or random values. 
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1. INTRODUCTION 

During data mining (DM) processes, the quality of the considered data determines the quality of its 
outcome; hence, data pre-processing is an important step towards achieving clean and quality data and 
determines the success of the mining process. Data pre-processing is the major step in knowledge discovery 
in database (KDD) process as it decreases data complexity and gives better conditions to subsequent data 
analysis. Data pre-processing aids in understanding the nature of the data, thereby allowing accurate and 
efficient data analysis. The next important step of KDD is the data itself. The input data must be prepared in a 
suitable format and structure that will suit each DM task perfectly. Raw data is not expected to be perfect 
without pre-processing. Since good DM models usually require well-structured data, the data quality must be 
improved via thorough data cleansing. The data values must be correct and consistent as missing data is a 
major problem during DM processes, especially when occurring in large amounts; however, it is not all 
attributes (instances) with missing values can be removed from the sample [1]-[3]. The problem of data loss 
is particularly apparent in decision-making processes, especially in online applications where data must be 
used exactly as it was generated. As a result, computer intelligence techniques such as B neural networks and 
other pattern recognition approaches have been used in current decision-making procedures. However, 
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decision-making processes cannot advance when some variables are not monitored, and the primary issue is 
that traditional computational intelligence algorithms cannot successfully handle input data with model views 
(MVs) or conduct regression or classification tasks [4], [5]. In most applications, finding a solution to the 
missing data problem is a tiresome effort, and this is not considered in most decision-making tasks. As a 
result, dealing with concerns relating to lost property necessitates quick and inefficient procedures. This 
raises the demand for computational and mental resources, such as procedures and theoretical frameworks 
that can lead to near-completion [6], [7]. Most of the time, inefficient tactics are used since there isn't enough 
time to identify better ways to cope with lost data at the time of observation, hence ineffective techniques like 
case deletion are used. Unfortunately, some widely used procedures cause more harm than good by 
producing biased and incorrect results. The remainder of the paper is laid out as follows: the section 2 
introduces missing values and highlights the most important research on diabetic machine learning models. 
Section 3 outlines the proposed strategy. The pain in my ass (also known Pima) Indian diabetes disease 
(PIDD) data set and its analysis are presented in the fourth part. Furthermore, the proposed embedding 
algorithm is assessed. Finally, the fifth section summarizes the proposed algorithm's outcomes and offers 
suggestions for future research. 
— Missing data in medical datasets 

Thinking about how the data points were lost in the first place is the simplest technique to deal with 
lost data. The three processes of missing data are randomly missing, randomly missing, and unignorable 
[2], [3], [6], [7]. To begin, the phrase "totally missing completely at random" (MCAR) refers to the fact that 
the data that is missing is not logged at random. While missing in random (MAR) denotes the fact that some 
data points for specific observations in the data collection are not logged in a random manner. The non- 
ignorable state type implies that the missing data is dependent on the missing values rather than being 
random. One of the easy ways of handling MVs is to delete the attributes that contain them from the data set. 
However, this is not a good method when dealing with data that contains many records with MVs as it will 
result to bias during the inference. In the presence of MVs, data analysis is a difficult task as it will expose 
the analyst to serious problems; in fact, if handled in a non-professional manner, it can lead to bias during 
data analysis and cause ambiguous conclusions; it can also limit the generalizability of the study 
outcome [8], [9]. 
—  Nature-inspired algorithms and salp swam algorithm 

The ability of nature-inspired metaheuristics to provide solutions to modern optimization problems 
has attracted much research interest, especially their performance on, nomadic people (NP-hard) optimization 
problems, such as the travelling salesman problem and feature selection [10]-[13]. One of the nature-inspired 
metaheuristics commonly used in solving difficult optimizations tasks is the particle swarm optimization 
(PSO) which was first developed in 1995 by Eberhard and Kennedy [14]. The PSO was inspired by the 
swarm behavior of natural species, such as the flocking of birds and the schooling of fish. The PSO has found 
application in different optimization field where it has performed excellently. The firefly algorithm (FA) is 
another metaheuristic that has demonstrated good performance in may applications; it was developed by 
Yang, (2009). In these multiagent frameworks, the search mechanisms are governed by efficient local search, 
randomization, and optimal solution selection. However, the randomization normally uses uniform or 
Gaussian distribution. Different types of nature-inspired algorithms have been proposed during the past two 
decades. Most of them inspired from a biological organism or social life, such as artificial bee colony (ABC), 
ant colony optimizer (ACO), FA, grey wolf optimizer (GWO), ant lioner optimizer (ALO), and nomadic 
people optimizer (NPO) [15]-[20]. NIAs have played a great role for solving different types of optimization 
problems, medical case studies [21]—[23], engineering [24]-[36], energy [37]-[48], and information security 
[49]-[51]. Salp swarm algorithm (SSA) is a recent nature inspired algorithm, which is inspired from the 
cylindrical jellyfishes-like creatures which are belong to Salpidae [52]—[55]. These creatures are moved by 
pushing the water backward in order to move forward. The swarming behavior of these Salps inspired the 
authors to propose SSA for solving the difficult optimization problems. Figures 1 portray the main shape and 
salp chain in SSA. This chain is formulated mathematically by dividing the population into two dfferent 
groups leader and followers. Where the leader is responsible for leading the other followers to better 
positions. The position of the leader is updated as (1): 


` (F; +c ((ub; — lb;)cz + Uj) c3 20 
x} = (1) 
F; — c ( (ub; — lbj)ca + lbj) c3 <0 


Where x} denotes the position of the leader salp in the search space, while ub and lb denote the upper and 
lower bounderies, finally c4, cz and c3 represent three random numbers. 
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2. METHOD 

In this section, the proposed imputation algorithm based on SSA is presented. The section is divided 
into two sub-sections. In the first subsection, the proposed imputation algorithm in general, while the second 
subsection explains the SSA algorithm used in this study in details. 


2.1. The proposed imputation algorithm 
In the process of imputing or estimating the missing values in the targeted case study, the imputation 
algorithm based is designed for this purpose. The proposed algorithm consists of several stages as follows. 

Figure 1 shows the block diagram of the proposed algorithm. 

a. Stage 1 dataset preparation. The proposed algorithm's preparation of the data set is the first step. It entails 
the following three processes for reading and preparing the data set: i) step 1 is to read the data set, ii) 
step 2 convert the data from its original (.xlsx) format to a comma separated values (.csv) file, which can 
be read by practically any current computer language, and iii) step 3 normalize the dataset to a constant 
range [0,1] using the minmax technique stated as in (2). 


— Xy-Min 
Ny ~ Max—Min (2) 


Where N, represents the normalized value, while X, represents the original value. Min and Max denote the 

maximum and minimum values of a specific feature respectively. 

b. Stage 2 the inputs. In this stage, the algorithmic parameters such as the size of the swarm, the maximum 
number of iterations, and other SSA controlling variables are entered. 

c. Stage 3 determine the positions of the missing values. In order to fill the missing values, the positions of 
these values should be determined. In addition, the number of these missing values is determined as well. 
Based on the previous two information, the solution representation for each solution in the swarm is 
structured. 

d. Stage 4 SSA implementation. In this stage, the SSA is executed to search for the best values, which 
replace the missing values in the dataset. The main steps of SSA are given in the next subsection. 

e. Stage 5 evaluation. In this stage, the best solution obtained using SSA is evaluated in terms of 
classification accuracy, error rate, sensitivity, and specificity. 
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Figure 1. The block diagram of the proposed algorithm 


2.2. SSA for missing values estimation 
The SSA method is designed in such a way that the following stages are required to be completed. 

Although not all these steps are required, they help to implement the technique more efficiently. 

a. In the parameter vector [S.S MaxItr], set the initial settings. The upper bound (UB) and lower bound (LB) 
values limit the search space. UB and LB values are assigned based on the case study, while swarm size 
(S.S) and maximum number of iterations are set according to different conditions. 

b. Initialization: generate a random position for each solution in the swarm, via the uniform distribution 
method as (3): 


X; = (UB — LB) x Rand(0,1) + LB (3) 


Bulletin of Electr Eng & Inf, Vol. 12, No. 3, June 2023: 1700-1710 


Bulletin of Electr Eng & Inf ISSN: 2302-9285 øO 1703 


Where Rand is a randomization method, which generates a random value in range [0,1]. 

c. Fitness function: in order to evaluate each generated—or estimated—solution via the classification accuracy 
(A). The accuracy is generated via K-fold cross validation where K is equal to 5. Three different 
classification models are used in this study, K-nearest neighbour (KNN), support vector machine (SVM), 
and Naive bayesian classifier (NBC). 

d. Position updating move the followers X; towards another leaders X} with higher intensity) I(F;) < I (Fj) 
via (4): 


xi = > (x} + xit) (4) 


Where i > 2 and xi denote the followers’ positions in the jt” dimensions. Check the boundaries limits: check 
whether the values obtained in the new position of the solution is within the search space or not, as in (5): 

' LB If xt < LB 
= í a (5) 


x; . 
7? (UB If Fix] > UB 


Then, xi s evaluated using the fitness function explained in step c. 


e. Sorting and ranking: after updating the positions of all fireflies, the swarm is sorted and ranked based on 
the fitness value. Obtain the leader (Xgest) value from the swarm (which will always be the topmost value 
after sorting). Compare every value of the X with itself. 

f. Stop condition: the first and second steps are executed only one time, while the rest steps (c-f) are iterated 
for t times. Meaning that the algorithm checks t if it is still less than MaxItr—which has been identified 
in the first step—then go to step d. Otherwise, exit the loop and return the last Fgest- 


2.3. SSA for missing values estimation 

The data was first gathered by the national institute of diabetes and digestive and kidney diseases. 
During the investigation, the World Health Organization's (WHO) recommendations were followed. Females 
must be at least 21 years old and of Pima native American descent to participate in this study. This data set 
has already been used by multiple researchers to develop classification algorithms; thus, it was chosen for 
this study so that it could be compared to other current PID diagnosis investigations. This data set contains 
768 examples, each with its own set of eight characteristics. Table 1 lists all the features in this data set, 
along with their numerical values. 

The last value, a binary, was used for the classification task; it was partitioned into 2 classes which 
are “class zero (non-diabetic) and class one (diabetic)”. The first 8 features in the dataset served as the input 
while the last value served as the ground truth. There are a total number of 268 diabetic cases (34.90%) in the 
dataset while non-diabetic cases accounted for 65.10% (500 cases). The missing data in most of medical case 
studies is a standard issue, for two main reasons. First, some of the medical tests are above the budget of the 
patients so they cannot afford them. Second, sometimes the values were not recorded correctly due to the 
time constraints. These missing values may affect on the classification performance. PIMA dataset is also 
associated with a large percentage of missing data as depicted in Table 2. All the features contain missing 
values, except the first feature where there are no missing values in it. 


Table 1. The features set in the dataset Table 2. Information about missing values in the dataset 

F Name Type Name Type 

1 No. Of times pregnant Numeric 1 No of times pregnant - 

2 Plasma glucose concentration | Numeric 2 Plasma glucose concentration 5 

3 Diastolic blood pressure Numeric (mmHg) 3 Diastolic blood pressure 35 

4 Triceps skin fold thickness Numeric (mm) 4 Triceps skin fold thickness 227 

5 2 hours serum insulin Numeric (uU/ml) 5 2 hours serum insulin 374 

6 Body mass index Numeric (kg/m?) 6 Body mass index 11 

7 Diabetes pedigree function Numeric 7 Diabetes pedigree function 1 

8 Age Numeric (years) 8 Age 63 


3. RESULTS AND DISCUSSION 
3.1. Experimental settings 

To evaluate the performance of proposed imputation algorithm, a set of experiments should be 
implemented. The evaluation process consists of several experiments, each experiment consists different test 
settings. The imputation algorithm has been written and executed using MATLAB, version 2018b, and 
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implemented in the environment of Windows 10 with CPU 2.6 GH-64 bit, and RAM 8 GB. On the other 

hand, the settings of the experiments depend mainly on the structural parameters, which are: number of 

iterations (ITR) and the number of solutions in the swarm (N). In order to validate the effect of these two 
parameters on the performance of the algorithm, several values of each one is implemented, as follows: 

a. Case 1 based on N: changing the number of solutions has an impact on the performance of any nature 
optimization algorithm, sometimes, the large size of N enhances the performance, however this may 
affect on the speed of the algorithm. Therefore, to determine the best N as much as possible, several tests 
are performed, N = {10,15,20,30}. 

b. Case 2 based on ITR. The number of iterations has another impact on the performance of the optimization 
algorithms. To determine the best possible ITR, several tests are performed where 
ITR = {25, 50, 100, 200}. 

c. Case 3 based on classifier. As explained in the previous section, the fitness function of proposed 
imputation algorithm depends on three different classifiers. In other words, there three different versions 
of the proposed imputation algorithm, imputation SSA with KNN (SSA-KNN), imputation SSA with 
SVM (SSA-SVM), and imputation SSA with NBC (SSA-NBC). 

The settings of the tests can be summarized in Table 3, each test was executed 10 run times. The 
obtained results of each test are: 

a. Beginning accuracy (B. Acc). Represents the obtained accuracy based on the original dataset with missing 
values. 

b. K-fold cross validation (CV.Acc). Represents the obtained accuracy using the proposed imputation 
algorithm. 

c. Original holdout accuracy (OR,. Acc). Represents the obtained accuracy based on different classifiers and 
the original dataset, when the dataset is divided into training set (65%) and testing set (35%). 

d. Optimized holdout accuracy (OP,. Acc). Represents the obtained accuracy based on different classifier 
and the enhanced dataset, when the enhanced dataset is divided into training set (65%) and testing 
set (35%). 


Table 3. Tests settings 


Test N ITR 
Ti 10 25 
T2 10 50 
T3 10 100 
Ta 10 200 
Ts 15 25 
Te 15 50 
T; 15 100 
Ts 15 200 
To 20 25 
Tio 20 50 
Tii 20 100 
T12 20 200 
Tis 30 25 
Tis 30 50 
Tis 30 100 
Tis 30 200 


3.2. Obtained results 
3.2.1. Results obtained using KNN as a fitness function 

In this part, KNN classification model is used for measuring the fitness of each solution in the 
swarm. The results of this experiments were obtained based on all [T, — Tie] mentioned in Table 3, where 
each test has been implemented ten times. The average results of each test are summarized in 
Figures 2 and 3. Figure 2 illustrates the results obtained using cross validation of the original and the 
optimized dataset. While the Figure 3 illustrates the comparison results obtained using holdout results of 
three classifier. 

It can be seen from the Figures 2 and 3 that the proposed imputation algorithm based on KNN 
model as a fitness function has enhanced the results. In other words, the proposed algorithm estimated and 
filled the missing values in PIDD dataset with values better for the prediction and classification process. In 
addition, it can be seen in the Figure 3 that KNN model has the best performance when it was used for the 
validation of the generated dataset, as compared to the others two classifiers. However, SVM has a very close 
performance to KNN, while the performance of NBC was the worst. 
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Figure 2. Comparison between average results of the obtained accuracies 
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Figure 3. Comparison between the average accurizes using holdout 


3.2.2. Results obtained using SVM as a fitness function 
In this experiment, SVM classification model is used for evaluating the solutions in the swarm. The 


experiments have been validated based on the test mentioned in Table 3. Ten run times have been 
implemented, and the average of these runs for each test is presented in Figures 4 and 5. The figures showed 
different results as compared to the previous experiment, as the SVM in Figure 5 showed a superior 
performance. SVM was ranked first, while NBC ranked third and attained the worst performance just like the 
previous experiment. On the other hand, the comparison between the obtained results in this experiment were 
much better than the results obtained using the original dataset with missing values (see Figure 6). 
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Figure 4. Comparison between average results of the obtained accuracies 
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3.2.3. Results obtained using NBC as a fitness function 

In the final experiment, NBC classifier is used for evaluating the generated datasets or the solutions 
in the swarm. The algorithm has been implemented ten run times based on the tests mentioned in Table 3. 
While the average of these runs is presented in the Figures 6 and 7. 

Figures 6 and 7 show that NBC has the worst performance as compared to the other three classifiers. 
Moreover, the comparison between the accuracy obtained based on the dataset filled using the proposed 
imputation algorithm were better than the original dataset in all tests. Therefore, NBC enhances the 
performance of the proposed algorithm in general, but with worse results as compared to the other classifiers. 

In the previous subsection, it was clear that the proposed SSA imputation algorithm based on all 
classifiers was able to handle the problem of the missing values in the PIDD dataset. Even the worst 
performance of NBC classifier was better than the best performance of all tests based on the original dataset. 
Moreover, there are three observations can be summarized as follows: 

a. When KNN used as a fitness function, the holdout validation experiments showed that KNN classifier based 
on the 35% testing set was better than the other classifiers. However, KNN ranked the second position when 
SVM or NBC used as fitness functions. In general, SVM showed the best performance due to the sequential 
minimum optimization (SMO) algorithm for tuning the C and y in the RBF kernel function. 

b. All the results obtained using SVM and KNN were more than 77%, while the results obtained using NBC 
were in range [70% and 75%]. 

c. It can be seen from cross-validation experiments, that the results were better when the number of the 
solutions—or the swarm size—are increased (i.e., tests Tio — T16). Meaning that the number of solutions has 
an obvious impact on the searching performance of FA. On the other hand, the number of iterations (ITR) 
has a less impact on SSA. 

The evaluation measurements other than the classification accuracy (explained in section 4.2) are 
presented in the Table 4. In the previous subsections, the proposed SSA imputation algorithm based on 
different classifiers was evaluated. The evaluation process depended mainly on sixteen tests, and two 
validation methods: cross validation and holdout. In this section, the proposed imputation algorithm is 
benchmarked and compared against four well-known imputation approaches on PIDD dataset. These 
approaches are: 

a. A,: removing the entire row with the missing values or attributes. This approach leads to decrease the 
amount of training data which may affect on the classification process. 

b. Az: replacing the missing values with zeros. In some cases, this could be a good solution, however, the 
value of zero may also affect on the classification process when the classification model is trained based 
on modified data. 

c. A3: replacing the missing values by the average or mean of the other values of the attribute. In most cases, 
this approach is better than the previous approaches because the generated values depend mainly on the 
other values of the same attribute. 

d. A,: replacing the missing values by random values in the range [0,1]. However, this method may generate 
values effects on the classification models. In other words, the values may have some noise, or change the 
distribution of the samples. 

The approaches above have been integrated with three classifiers used in this study and executed ten 
run times. Then, the best, the mean, the standard deviation was recorded. Table 5 presents the comparison of 
the four approaches against immunofloresensi assay (IFA)-KNN, IFA-SVM, and IFA-NBC. In addition, the 
mentioned approach, the classification accuracy of the dataset without implemented any imputation approach 
is also presented. 
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Figure 6. Comparison between average results of the obtained accuracies 
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Figure 7. Comparison between the average accurizes using holdout 


Table 4. Evaluation measurements 


Algorithm Sensitivity Specificity MSE 

IFA-KNN 0.520583 0.862484 0.159723 
IFA-SVM 0.532583 0.873494 0.154563 
IFA-NBC 0.489166 0.762887 0.161783 


Table 5. Comparison against other imputation approaches 


Classifier Approach Best Mean Std. Dev 
KNN Original 0.75008 0.75008 0 
Ay 0.73641 0.73122 0.24782 
A, 0.75421 0.75231 0.21412 
A3 0.76822 0.76741 0.19321 
Ay 0.76025 0.75942 0.20411 
IFA 0.794153 0.78421 0.18695 
SVM Original 0.77935 0.77935 0 
A, 0.75982 0.75611 0.22782 
A, 0.76724 0.76514 0.21842 
A3 0.77942 0.77862 0.20142 
A, 0.77834 0.77285 0.19782 
IFA 0.790758 0.78793 0.002744 
NBC Original 0.70414 0.70414 0 
Ay 0.69842 0.69215 0.25413 
A, 0.69624 0.69342 0.24821 
A3 0.70128 0.70101 0.20421 
Ay 0.70431 0.70321 0.20142 
IFA 0.73348 0.72569 0.00754 


It is obvious that the proposed imputation algorithm obtained the highest results as compared to the 
other approaches. A, with all classifiers attained the worst position, because in this approach the many 
samples were deleted from the dataset, which decreases the training set. The second approach A, had almost 
the same performance with slightly better results due to using zero as the value for all missing data. On the 
other hand, the third and fourth approaches A and A, were better than the previous approaches because of 
filling the missing data with mean or random values. The generated values are better than using zero, or 
removing the sample with missing data, because at least these approaches filled them. Moreover, the best 
attained results were obtained using IFA-KNN, however, IFA-SVM has better average results. The standard 
deviation proofed that both of IFA-SVM and IFA-NBC are more stable than IFA-KNN. 


4. CONCLUSION 

The missing data or missing values is an issue with most of the medical datasets. It occurred for two 
main reasons: a) the expense of the medical tests and b) the fault of recording all the features for time 
constraints or human faults. Therefore, there is a need for a specific process for reparation these missing data, 
this process is called “Imputation”. In this research, SSA is used as an imputation method. Three different 
classifiers are used for evaluating the generated missing values, these classifiers are: KNN, SVM, and NBC. 
The proposed imputation algorithm has been evaluated based two main experiments. First, using cross 
validation with 5 folds, while in the second experiment, the algorithm has been evaluated using holdout 
validation method, where the generated dataset was divided into training set (65%) and testing set (35%). The 
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results showed that the proposed imputation algorithm could estimate the missing values in PIDD and 
enhanced the classification accuracy for all classifiers. SVM showed ranked the best, while NBC ranked 
the worst. 
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